Briefing on the Modern ML Stack with R

Javier Luraschi, RStudio

Overview

Why R?

Modern R

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

About RStudio

RStudio’s Multiverse Team

Authors of R packages to support Apache Spark, TensorFlow and MLflow. Contributors to tidyverse and Apache Arrow.

The Modern ML Stack with R

Spark

Motivation

In an ideal world, all R packages work with Spark, like magic. Such is the case for dplyr and sparklyr.

Timeline

Master Spark with R

What’s new? - Arrow

What’s new? - XGBoost

sparkxgb is a new sparklyr extension that can be used to train XGBoost models in Spark.

#> Observations: ??
#> Variables: 5
#> Database: spark_connection
#> $ Species                <chr> "setosa", "setosa", "setosa", "setosa", "…
#> $ predicted_label        <chr> "setosa", "setosa", "setosa", "setosa", "…
#> $ probability_versicolor <dbl> 0.003566429, 0.003564076, 0.003566429, 0.…
#> $ probability_virginica  <dbl> 0.001423170, 0.002082058, 0.001423170, 0.…
#> $ probability_setosa     <dbl> 0.9950104, 0.9943539, 0.9950104, 0.995010…

What’s new? - Broom

broom summarizes key information about models as data frames, the last sparklyr release marks the completion of all modeling functions.

# Source: spark<?> [?? x 4]
   user  item rating .prediction
  <dbl> <dbl>  <dbl>       <dbl>
1     2     2      5        4.86
2     1     2      4        3.98
3     0     0      4        3.88
4     2     1      1        1.08
5     0     1      2        2.00
6     1     1      3        2.80

What’s new? - TF Records

sparktf is a new sparklyr extension allowing you to write TensorFlow records in Spark. This can be used to preprocess large amounts of data before processing them in GPU instances with Keras or TensorFlow.

What’s new? - VariantSpark

VariantSpark is a framework based on scala and spark to analyze genome datasets. It is being developed by CSIRO Bioinformatics team in Australia. VariantSpark was tested on datasets with 3000 samples each one containing 80 million features in either unsupervised clustering approaches and supervised applications, like classification and regression.

What’s new? - SparkHail

Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data. Hail is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS).

What’s new? - GitHub

New github.com/r-spark organization to support ecosystem of Spark and R extensions.

What’s next? - SparkNLP

Spark NLP: State of the Art Natural Language Processing. The first production grade versions of the latest deep learning NLP research.